The deciphering of the neural code is one of the last grand mysteries of physiology. Using methods such as calcium imaging or electrophysiology to record activity of genetically defined neurons, researchers around the globe hope to gain critical insight into how information is coded and processed by the nervous system.
The aim of this project is to apply techniques of statistical machine learning to real-world neuronal activity data and help describe how sensory stimuli are represented in the murine auditory cortex.
This project is a part of the PSTAT 231 course (instructed by dr. Katie Coburn (D, n.d.)). Since the overall objective of this project is to learn how to use the tools we were presented during class, we will try to fit as many models we covered during class as possible.
Brain tissue is organized into functionally and anatomically distinct regions. These specialized areas often perform very specific tasks such as handling of sensory information, calculate appropriate motor commands and many others. However, traditionally, the only detectable output other than the behavioral or physiological change itself is the variation in electrical properties of single neurons residing at the given site. The link between neuronal activity and its global effect is still very poorly understood. Since it is key to the understanding of how the information is coded and processed, it is subject to intense research.
The classical approach towards studying brain areas handling sensory information is based on observing responses of single neurons to various changing parameters (Theunissen and Elie 2014) . The collected data is subsequently translatable into neuron-specific receptive fields (RFs), study of which has revealed tonal organization of neurons both in the lower parts of the auditory pathway (mouse dorsal inferior colliculus (Barnstedt et al. 2015) and in the auditory cortex (Issa et al. 2014). The available data might suggest that the processing of the auditory input is dependent on single neurons encoding the perception of the input component that provokes them to maximal activity (Frégnac and Bathellier 2015). Furthermore, this hypothesis is in accord with the “neuronal doctrine” (Barlow 1972), which suggests such coding in other parts of the brain as well.
Although such linear coding of single input components is supported by data from lower levels of the auditory pathway, neurons from higher levels express significant non-linearity in their response patterns. For example, certain neurons specifically respond to unique physical attributes of natural sounds rather than being strictly frequency-tuned (Theunissen and Elie 2014). These non-linear responses are context-dependent (Kato, Gillet, and Isaacson 2015), thus oppose the classic, exclusively feed-forward model and require the inclusion of input from local neuronal populations.
In the light of growing evidence, it became apparent that the application of the classical receptive field concept in the auditory cortex should be revisited and possible alternatives need to be investigated. Another important factor was the appearance of novel techniques capable of recording large genetically defined neural populations, which has enabled in vivo testing of alternative models, namely two-photon imaging of genetically encoded calcium indicators (Lütcke and Helmchen 2011; Chen et al. 2013). Unlike the classical electrophysiological approach, these are not limited to recordings of a small number of neurons, a limitation having strongly shaped the direction of scientific effort in the past.
The search for an alternative brought attention to a model already proposed for the hippocampal navigational apparatus. Here, specialized cell-types such as head direction cells or place cells encode spatial navigation and their activity is hypothesized to be organized following an attractor network model (Zhang 1996; Redish 1999). The attractor network model suggests that a specifically stimulated neuron excites cells of similar tuning (highly interconnected) while the lack activation results in the inhibition of cells responsive to different values of the given variable (due to a global inhibition net) (Redish 1999).
The activity of local cortical networks (of span ~200 µm) was discovered to be limited to very few reciprocally competitive response modes (mostly 1-3), while the shift between them is abrupt (Bathellier, Ushakova, and Rumpel 2012a), thus expressing attractor-like dynamics. Such discrete shifting between few possible modes suggests that the attractor-like dynamics could form a base for the encoding of perceptual categories (Wang 2008). Moreover, the role of the auditory cortex has recently been hypothesized to lie within a more complex processing of the auditory input than previously thought, namely representing sounds as auditory objects (Russ, Lee, and Cohen 2007). Since the sensory categorization offers a balanced coding strategy between detail acquisition and generalization across experiences (Seger and Miller 2010), it reasonably fits within the current context of the auditory cortex.
In summary, the investigation of discrete network dynamics and categorization in the mouse auditory cortex by (Bathellier, Ushakova, and Rumpel 2012b) has produced a significant amount of useful information. Among these is evidence supporting the idea that it is not single neurons but hundreds of cells which form a functional unit in the neocortex (Averbeck, Latham, and Pouget 2006). Moreover, it was shown that single subnetworks are comprised of a redundant number of cells and are non-discretely divided, possibly to counter the high overall noise. However, most importantly, it was shown that different response modes (of a local neuronal network) recorded by calcium imaging reflect the categorization of sounds observed also behaviorally, thus suggesting that the attractor final states may be directly representing the perceptual category.
In their experiments, Bathellier et al. have used a small organic dye, Oregon Green BAPTA-1 (OGB-1), which was the state-of-the-art calcium indicator at the time. Unfortunately, the use of OGB-1 is restricted to acute experiments only.
In this study, our aim was to reproduce the experiments performed by Bathellier et al. using a genetically encoded calcium indicator GCaMP7f (Dana et al. 2019), one of the first protein-based sensors matching the performance of OGB-1, and take advantage of the possibility to generate chronic preparations. Using this protocol, we set out to describe the temporal stability of cortical dynamics, namely that of cortical representations of complex auditory stimuli.
Note:
An important feature of neuronal activity in primary sensory cortices including the auditory cortex is that it is that is sparse and exhibits very high trial-to-trial variability (Hromádka, DeWeese, and Zador 2008).
Mice were implanted with chronic cranial windows over their right auditory cortex (fig.1). Expression of a genetically encoded calcium indicator GCaMP7f (Addgene catalog number: 104488-AAV9) in the underlying tissue was achieved using microinjection (Nanoject iii, Drummond inc.).
fig.1, example of a mouse implanted with a chronic cranial window over its right auditory cortex
After three weeks of recovery, mice were checked for expression (fig.2) and entered the imaging protocol.
fig.2, Chronic cranial window over mouse auditory cortex. Bright spots indicate GCaMP7f expression
During the first day of imaging (day 0), multiple fields of view (FOV) (fig.3) were subjected to 10 repeats of a 39-stimulus battery of auditory stimuli (awake imaging). These include pure tones, complex sounds and natural stimuli mimicking what could be a physiologically relevant auditory percept. The stimulation battery was kindly provided by dr. Bathellier.
fig.3, An example FOV of neurons expressing GCaMP in the auditory cortex of a mouse
During stimulation, changes in fluorescence of the calcium indicator were recorded using a Two-photon microscope (Ultima iv, Prarie view) running the ScanImage (Vidrio technologies).
Time bins of 250ms after the presentation of the stimulus (39 randomized stimuli in 1 trial, 1s interstimulus intervals, 10 trials per session) were analyzed using the in-house developed TwoPhotonProcessor software \[@tomek2013a\]. Spike probabilities were extracted from changes in fluorescence cause by action potential evoked calcium transients using the foopsi algorithm and compiled in MATLAB into matrix form (fig.4).
fig.4, Calcium transients evoked by action potentials and reported by calcium indicators can be used as a proxy for neuronal activity. Rows: neurons, Columns: trials
We have revisited the same FOV one day after the first session and subsequently every five days until day+30. The same battery of stimuli was presented to the same neurons under matching conditions. Fig.5 provides an example of how the activity pattern elicited by a single sound (sound6) changes between sessions (titles above columns (yyyymmdd)).
The resulting dataset it comprised of 15 FOVs from 11 animals and as such is imported into R.
fig.5, Activity of neurons elicited by sound 6 in a single FOV revisited multiple times. Rows: neurons, Columns: trials, rectangles: Sessions
The aim of this project is to analyze the above described data using methods of statistical machine learning.
Our goal is to answer the following questions:
Can we detect a consistently stable representation of a sensory stimulus in the inherently noisy and variable calcium imaging data using unsupervised learning?
Is it possible to train a machine learning model to classify whether a specific trial was or was not elicited by a sound represented in the recorded neuronal population?
Which model performs the best?
Some of the functional analysis had to be performed in MATLAB since the Two Photon Processor (Tomek, Novak, and Syka 2013)–which is the framework used for raw fluorescence trace analysis–does not exist outside MATLAB. Consequently, data in the form of a MATLAB structured list is imported and transformed into a tibble dataframe.
The data used in this analysis will be that collected for field of view f34c, totaling 3120 observation points of spike probability (effectively a measure of neuronal activation). The data is organized with the following variables:
day: a factor with levels 0 to 7 to encode from which session the data was collected (note that 0 to 7 correspond to days 0, +1, +5, +10, +15, +20, +25, and +30, respectively)
sound: a factor with levels 1 to 39, each level representing a particular sound stimulus
V1 through V34: the k-th neuron of the FOV
Additional analysis on the data for the other 14 FOVs remains a viable potential extension of this project.
library(R.matlab)
library(readr)
library(tidyverse)
library(data.table)
library(cluster)
library(dendextend)
library(factoextra)
library(ROCR)
library(boot)
library(caret)
library(glmnet)
library(tree)
library(maptree)
library(randomForest)
library(gbm)
library(e1071)
library(reshape2)
#initialize df, otherwise using the superassignment operator will throw an error (since df is also a built-in function)
df <- NULL
#function for loading/reloading our data
load_matlab_data <- function(all_fov = FALSE, scramble_sound = FALSE) {
dataMat <<- R.matlab::readMat("sortAudFile.mat", fixNames=T, )
#data of the particular FOV we will use
f34c <<-dataMat$sortAudFile[5]
#if specified by user, will load the data of the rest of the FOVs
if (all_fov) {
f19<<-dataMat$sortAudFile[1]
f33<<-dataMat$sortAudFile[2]
f34a<<-dataMat$sortAudFile[3]
f34b <<-dataMat$sortAudFile[4]
f35 <<-dataMat$sortAudFile[6]
f37 <<-dataMat$sortAudFile[7]
f38 <<-dataMat$sortAudFile[8]
f40 <<-dataMat$sortAudFile[9]
f41a <<-dataMat$sortAudFile[10]
f41b <<-dataMat$sortAudFile[11]
f42 <<-dataMat$sortAudFile[12]
f43a <<-dataMat$sortAudFile[13]
f43b <<-dataMat$sortAudFile[14]
f44 <<-dataMat$sortAudFile[15]
}
#initialize data tibble as the 10 trials x 39 neuron responses from day 1, sound stimulus 1
df<<-as_tibble(f34c[[1]][[1]][[1]])
df <<- df %>% add_column(sound = 1)
df <<- df %>% add_column(day = 0)
tempF <- f34c[[1]]
#read in the rest of the data for days and sounds stimuli
for (j in c(1:8)){
for (i in c(1:39)) {
if (i+j != 2){
df <<- df %>% add_row(tibble(as_tibble(tempF[i][[1]][[j]]), sound = i,day = j-1))
}
}
}
#if specified by user, randomly scramble sound identity (with replacement)
if(scramble_sound){
set.seed(132)
for (i in 1:length(df$sound)){
df$sound[i] <<- sample(x = c(1:39), size=1, replace = TRUE)
}
}
df$sound <<- factor(df$sound)
df$day <<- factor(df$day)
}
load_matlab_data()
sumVect_sound <- c(1:39)
for (i in c(1:39)) {
data_by_sound <- df %>% filter(df$sound==i)
temp.df <- as.matrix(data_by_sound %>% select(V1:V34))
sumVect_sound[i] = sum(colSums(temp.df))
names(sumVect_sound)[i] <- paste("sound ", i)
}
barplot(sumVect_sound/80, names.arg = names(sumVect_sound), las = 2, cex.names = 0.5, col = 2, main = paste("Bar graph of response averages by sound"), ylab="spike probability average")
From the histogram above it is clear that a few sounds–namely sounds 26, 33, and 34–on average tend to elicit a greater response. It is possible that these sounds will be influential in our unsupervised model fitting, although if there is extreme inconsistency in the strength of response or the particular neuron that is responding we may not be able to create a model that aligns with our theoretical framework.
To do an introductory investigation of our neuron variables (V1 through V34), we can plot a histogram:
avgVect_neuron <- df %>% select(V1:V34) %>% colMeans(na.rm=TRUE)
hist(x=avgVect_neuron, breaks = 10, labels = TRUE, col = 3, main = paste("Histogram of average spike probability across all trials"), ylab="count", xlab = "average spike probability")
avgVect_neuron %>%
sort(decreasing = TRUE) %>%
.[1:2] %>%
names() %>%
paste(collapse = " and ") %>%
paste("are our most responsive neurons")
## [1] "V7 and V33 are our most responsive neurons"
Note that the majority of neurons have a low average spike probability, with the exception of V7 and V33. We can’t preemptively conclude which neurons will or will not be significant predictors based on this knowledge alone, but we can dig deeper and look at the 5-number-summary of each neurons’ responses to get a better sense of how they compare to one another.
#reshape data
long <- melt(df)
ggplot(data=long, aes(variable, value), col = 4) + geom_boxplot()
It is easy to identify a few very extreme, lone observations near the top of the graph. Although we cannot merely exclude these from our dataset (cherry-picking is bad practice), we’ll disregard them here in our visualizations.
#reshape data
long <- melt(df)
ggplot(data=long, aes(variable, value), y=0:10) +
geom_boxplot() +
scale_y_continuous(limits = c(0:10))
Zooming in, we can see that all of our boxplots are compressed to the x axis, meaning that for each neuron the vast majority of data is simply zero, and any non-zero value is considered an outlier. Immediately, we have a sense of the scarcity of “informative” data, which could be dangerous if not handled carefully.
Neuron specificity (detailed in the introduction) would suggest that the theoretical distribution of responses would not be identically distributed, and perhaps not independently distributed either, considering the proposition of the attractor model. As the last step of our exploratory data analysis, we’ll compute a correlation matrix of our neuron variables (not displayed for readability).
#correlation matrix
cor_matrix <- df %>% select(V1:V34) %>% cor()
#print out highly correlated variable pairs at the specified threshold
for (i in 1:nrow(cor_matrix)){
correlations <- which((cor_matrix[i,] >= 0.4) & (cor_matrix[i,] != 1))
if(length(correlations)> 0){
print(colnames(cor_matrix)[i])
print(correlations)
}
}
## [1] "V7"
## V33
## 33
## [1] "V33"
## V7
## 7
None of the neurons were found to have a correlation of higher than 0.75. The most highly correlated pair is V7 and V33, with a correlation of 0.4017.
The hypothesis proposed by the work inspiring these experiments postulates that local neuronal populations represent sensory precepts by categorizing them into a small number of perceptual categories (Bathellier, Ushakova, and Rumpel 2012b). Since our aim was to reproduce these results using tools enabling long-term imaging, we first try to identify sounds producing a recognizable activity pattern.
In order to do this, we employ an unsupervised learning method, hierarchical clustering.
We perform this on a data subset from the first day of recordings (day 0), since that is when fields of view were selected for long term studies by performing clustering online during imaging.
First, we will loop through all of the days performing hierarchical clustering on the dataset. The reason why we chose this approach is the following:
Each day, the neuronal population had to be revisited and relabeled, which may introduce added variability and be regarded as different conditions.
Previously published work only tested stability within one imaging session, so that is what we expect to see.
Based on previously published work, we expect to see a large general cluster containing most of the sounds that consistently fail to elicit a response. We are interested in the second, much smaller cluster consisting of a very small number of sounds (1-3) that do elicit a consistent response.
We loop through all of the imaging sessions (day 0, +1, +5, +10, +15, +20, +25 and +30) and perform hierarchical clustering of response patterns within each day. Finally, we look at which sound was clustered into the second cluster most often.
We limit the number of clusters to 2 as we do not expect to see more than 2 modes of response. One being no response, the other being an activity pattern elicited by one or a small number of stimuli.
#repeat data reloading as a precautionary step in case some prior manipulation was introduced and not corrected.
load_matlab_data()
#initialize empty variables...
#(we later find sound 33 to be of interest, so we will collect extra information on it here)
#table with counts of how many sound 33 observations were assigned to each cluster, for each hclust run
table.accum <- matrix(, nrow=2,ncol=8)
rownames(table.accum) <- rownames(table.accum, do.NULL = FALSE, prefix = "cluster ")
colnames(table.accum) <- colnames(table.accum, do.NULL = FALSE, prefix = "hclust ")
#matrix of which sound was most commonly assigned to cluster 2, for each hclust run
maxSecond <- matrix(, ncol=8)
rownames(maxSecond) <- "sound"
colnames(maxSecond) <- colnames(maxSecond, do.NULL = FALSE, prefix = "hclust ")
#perform hierarchical clustering for each individual day
for (k in c(0:7)){
df.Day <- df %>% filter(df$day==k)
df.neurons <- df.Day %>% select(V1:V34)
dist_mat <- dist(df.neurons,method = "euclidian")
hclust_avg <- hclust(dist_mat, method = "average")
cut_avg <- cutree(hclust_avg, k = 2)
table.accum[,k+1] <- table(cut_avg, df.Day$sound)[,33]
maxSecond[k+1] <- as.numeric(which.max(table(cut_avg, df.Day$sound)[2,]))
}
#example dendogram for day 0
dist_mat <- dist(df.neurons,method = "euclidian")
hclust_avg <- hclust(dist_mat, method = "average")
## dendrogram: branches colored by 2 groups
dend1 = as.dendrogram(hclust_avg)
# color branches and labels by 2 clusters
dend1 = color_branches(dend1, k=2)
dend1 = color_labels(dend1, k=2)
# change label size
dend1 = set(dend1, "labels_cex", 0.3)
plot(dend1,horiz=T, main = "Example two-cluster dendrogram for Day 0")
Below are the IDs of the stimuli that were most abundantly clustered into the second mode within each of the 8 sessions:
#Display the counts of
maxSecond
## hclust 1 hclust 2 hclust 3 hclust 4 hclust 5 hclust 6 hclust 7 hclust 8
## sound 34 38 36 33 34 26 33 33
Sound number 33 is the most abundant sound eliciting a response pattern falling into cluster 2. The counts for the sound 33 cluster assignments are:
#Display the counts of
table.accum
## hclust 1 hclust 2 hclust 3 hclust 4 hclust 5 hclust 6 hclust 7
## cluster 1 10 10 10 8 9 10 9
## cluster 2 0 0 0 2 1 0 1
## hclust 8
## cluster 1 7
## cluster 2 3
See 5.2 for further discussion of these results.
In order to assess correlations between variables (neurons #1 to #38) we perform principal component analysis.
#reload original (non-scrambled) data
load_matlab_data()
## Warning: The `x` argument of `as_tibble.matrix()` must have unique column names if `.name_repair` is omitted as of tibble 2.0.0.
## Using compatibility `.name_repair`.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was generated.
#get subset of data (only day 0)
df.Day <- df%>% filter(df$day==0)
df.neurons <- df.Day%>% select(V1:V34)
pca.neurons <- prcomp(df.neurons, scale=TRUE, center = TRUE)
pr.var <- (pca.neurons$sdev)^2
pve <- pr.var/sum(pr.var)
The graphical representation of the proportion of variance explained by single principal components:
plot(pve, xlab="Principal Component",
ylab="Proportion of Variance Explained ", ylim=c(0,1),type='b')
We can observe that all the PCA share approximately equal contributions to amount of the explained variance in the data, with the exception of one PCA that explains roughly twice as much variation as any other.
For that particularly important PCA (i.e. PC1), we can investigate which of the neurons have the highest absolute loadings:
#The rotation matrix provides the principal component loadings
#each column of pca.neurons$rotation contains the corresponding principal component loading vector
pc1_loadings = pca.neurons$rotation[,1]
head(sort(abs(pc1_loadings), decreasing = TRUE))
## V13 V10 V25 V21 V6 V24
## 0.2835726 0.2532116 0.2460992 0.2441526 0.2308034 0.2283170
The graphical representation of the cumulative proportion of variance explained by single principal components:
plot(cumsum(pve), xlab="Principal Component ",
ylab=" Cumulative Proportion of Variance Explained ", ylim=c(0,1), type='b')
abline(v =min(which((cumsum(pve)>=0.9))))
We observe that in order to explain 90% of total variation in the data, we need at least 27 principal components. The high number of components suggests we indeed do need all of our variables in order to capture the variance, thus we should use all of them in our future analysis.
Using methods of unsupervised learning, namely hierarchical clustering, we were able determine the presence of a consistently represented sound of interest in one of the FOVs.
The FOV in question is f34c while the sound selected based on the number of times it was clustered into the non-general cluster is sound 33.
Coming back to the first question of our experimental aim: Can we detect a consistently stable representation of a sensory stimulus in the inherently noisy and variable calcium imaging data using unsupervised learning?
The conclusion is that yes, we were able to detect a specific sound that did elicit a discernible population response detectable in multiple sessions. The consistency may seem low (see 5.1.1.1), but as it the most reliable across-session predictor of population activity it was chosen as a valid candidate for further analysis.
The low consistency is likely caused by a combination of factors including:
high trial-to-trial variability of responses inherent to the neocortex
changing conditions of recording (changes in expression levels)
changes in the state of the animal (laboratory mouse strains are know to become deaf fairly quickly)
In this part, we aim to answer the second of our experimental questions: Is it possible to train a machine learning model to classify whether a specific trial was or was not elicited by a sound represented in the recorded neuronal population?
In the following paragraphs, we will try to train, tune, and evaluate multiple classes of models we covered during class.
First, we need to create a factor variable defining whether the observation was obtained as a response to the examined sound. Based on the hierarchical clustering performed earlier, this stimulus number is sound 33,
load_matlab_data()
df = df %>%
mutate(s33=as.factor(ifelse(sound == "33", "Yes","No")))
Since only one sound is considered a positive case, our data contains a small fraction of positive cases for learning and training purposes (1/39). For that reason, it is important to perform stratified sampling, so that the limited number of positive labels is evenly split between training and testing data sets.
#stratified
set.seed(123)
df <- df %>% mutate(id = row_number())
#Check IDs
head(df$id)
## [1] 1 2 3 4 5 6
train.index <- createDataPartition(df$s33, p = .7, list = FALSE)
train <- df[ train.index,]
test <- df[-train.index,]
The choice of model is determined by the data. Since the the task at hand is to train a model to classify whether a population response was or was not elicited by sound 33, we will start with logistic regression.
The dataset needs to be further formatted in order to satisfy the input requirements by the glm() function. What is presented to the model is the activity of neurons (variables V1:V34) and a factor variable of levels Yes/No determining whether the observation was recorded in response to sound 33 or not.
#generate subdatasets of only activity data for learning purposes
excl<-c('sound','day','id')
train.cv <- train %>% select(., -excl)
test.cv <- test %>% select(., -excl)
We proceed with fitting the data with our model.
model.glm = glm(formula = s33~., data=train.cv, family = binomial)
#summary(model.glm)
calc_error_rate <- function(predicted.value, true.value){
return(mean(true.value!=predicted.value)) }
Minor formatting of the prediction output is needed to obtain a training and testing error
prob.training.glm = NULL
prob.training.glm = as.factor(ifelse(predict(model.glm, type="response")>0.5,"Yes","No"))
trainError<- calc_error_rate(prob.training.glm,train$s33)
paste("Training error: ", trainError)
## [1] "Training error: 0.0238095238095238"
testError<- calc_error_rate(prob.training.glm,test$s33)
paste("Test error :", testError)
## [1] "Test error : 0.0343406593406593"
According to these rates, the model incorrectly classifies a given observation in ~3% of the cases. The testing error is slightly higher than the training error, which can be expected due to the fact that none of the data in the test split were available for training.
However, despite these seemingly low error rates, one point of concern is that our model may be performing well simply by consistently predicting that any given observation does not belong to sound 33. We can imagine a model that is designed to predict ‘No’ for any observation; if given our full dataset, it would predict correctly for the data belong to 38 of the 39 sounds, and incorrectly only for sound 33, which accounts for 2.6% of all our observations.
With this in mind, overall error rate is not by itself the entire story. More information can be gained by break down the error into it’s different components, as done below.
Another useful method for assessing and visualizing the performance of a classifier is the Receiver Operating Characteristic (ROC) curve plotting the True positive rate against the False positive rate.
prob.training.glm = NULL
prob.training.glm =predict(model.glm)
pred.glm = prediction(prob.training.glm, train.cv$s33)
perf.glm = performance(pred.glm, measure="tpr", x.measure="fpr")
plot(perf.glm, col=2, lwd=3, main="ROC curve")
abline(0,1)
Area under curve (AUC) is a useful metric describing the performance of the classifier based on the ROC curve. Ranging from 0 to 1, the resulting number is a good indicator of the % accuracy of model predictions.
auc.glm = performance(pred.glm, "auc")@y.values
paste("Area under the curve: ", auc.glm)
## [1] "Area under the curve: 0.88423905746509"
Results:
The performance of the model as assessed by the Area under curve (AUC) metric seems to be fairly good; however, since the positive label is so sparse in our data and most of the correctly predicted cases belong to the ‘Not Sound 33’ category, we should investigate more options of performance evaluation.
In order to solidify our results, it is useful to perform cross-validation, meaning we retrain our model on multiple randomly selected subsets of data. The aim of this approach is to mitigate the danger of skewed results caused by the nature of the data and provide a statistically relevant result.
cost <- function(r, pi = 0) mean(abs(r-pi) > 0.5)
model.glm.cv<-cv.glm(test.cv,model.glm,cost,10)
The two values below describe:
model.glm.cv$delta[1]
## [1] 0.04166667
model.glm.cv$delta[2]
## [1] 0.02793138
10-fold cross validation has produced similar error rates to our previous estimates. However, this does not necessarily solve the problem of having too few positive cases (i.e. cases when an observation belongs to sound 33). This metric might be still very heavily skewed by the fact that classifying a non-sound33 observation correctly also results in a true positive.
Thus, it is informative to investigate whether the accuracy of classification extends to all levels of the factor variable s33.
Constructing a confusion matrix should provide the most insight into how well the model can classify whether an observation was or was not elicited by sound 33.
prob.training.glm = as.factor(ifelse(predict(model.glm, type="response")>0.5,"Yes","No"))
confusionMatrix(data = prob.training.glm, reference = train.cv$s33)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 2120 44
## Yes 8 12
##
## Accuracy : 0.9762
## 95% CI : (0.9689, 0.9822)
## No Information Rate : 0.9744
## P-Value [Acc > NIR] : 0.3238
##
## Kappa : 0.3064
##
## Mcnemar's Test P-Value : 1.212e-06
##
## Sensitivity : 0.9962
## Specificity : 0.2143
## Pos Pred Value : 0.9797
## Neg Pred Value : 0.6000
## Prevalence : 0.9744
## Detection Rate : 0.9707
## Detection Prevalence : 0.9908
## Balanced Accuracy : 0.6053
##
## 'Positive' Class : No
##
Results
Although the overall accuracy of the model is high, it is clear that it’s accuracy in terms of correctly predicting the YES case in sound33 variable is fairly poor. Thus we should try and investigate options how to improve this performance
Building on the logistic regression model, we can try and improve its performance by introducing penalties to different coefficients of the model. This can prevent overfitting and aid with classification performance.
The first regularization method we will try is ridge regression. In this case, we perform L2 regularization, meaning we impose a penalty equaling the square of the magnitude of coefficients. As such, no coefficients can reach zero.
We can aim to tune the “magnitude” of the penalty imposed by ridge regression by using cross-validation on different values for the parameter of lambda.
As λ increases, the flexibility of the ridge regression fit decreases, leading to decreased variance but increased bias.
By performing 10-fold cross-validation, we find the optimal value of λ (that which minimizes misclassification error) to be:
lambda.list.ridge = 1000 * exp(seq(0, log(1e-5), length = 100))
train.neurons <- train.cv %>% select(V1:V34)
train.labels <- as.factor(train.cv$s33)
test.neurons <- test.cv %>% select(V1:V34)
test.labels <- as.factor(test.cv$s33)
set.seed(1)
model.ridge.cv=cv.glmnet(as.matrix(train.neurons),train.labels, alpha = 0,nfolds = 10, lambda = lambda.list.ridge, family="binomial", type.measure ='class')
lambdaSelect <- model.ridge.cv$lambda.min
lambdaSelect
## [1] 0.1629751
The minimal misclassification error achieved by this optimal λ is:
model.ridge.cv$cvm[which (lambda.list.ridge==lambdaSelect)]
## [1] 0.02472527
#Plot showing how different values of **λ** reflect in the misclassification error
plot(model.ridge.cv)
abline(v = log(lambdaSelect))
From the above plot, we can see that the model selected by ridge regression performs just slightly better than an extremely rigid model (higher values of log(λ)), such as our hypothetical model that indiscriminately predicts ‘No.’
Below we see how different values of λ reflect in the adjustment of our model coefficients( in the resulting model (the vertical line represents the value of λ we selected).
plot(model.ridge.cv$glmnet.fit, "lambda", label=FALSE)
abline(v = log(lambdaSelect))
As we discussed previously, decomposing our error via a confusion matrix is a particularly useful way of assessing model performance in this case:
pred.ridge=predict(model.ridge.cv,s=lambdaSelect, newx = as.matrix(test.neurons), type="response")
pred.ridge = as.factor(ifelse(pred.ridge>0.5,"Yes","No")[,1])
confusionMatrix(data = pred.ridge, reference = test.labels)
## Warning in confusionMatrix.default(data = pred.ridge, reference = test.labels):
## Levels are not in the same order for reference and data. Refactoring data to
## match.
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 912 24
## Yes 0 0
##
## Accuracy : 0.9744
## 95% CI : (0.9621, 0.9835)
## No Information Rate : 0.9744
## P-Value [Acc > NIR] : 0.554
##
## Kappa : 0
##
## Mcnemar's Test P-Value : 2.668e-06
##
## Sensitivity : 1.0000
## Specificity : 0.0000
## Pos Pred Value : 0.9744
## Neg Pred Value : NaN
## Prevalence : 0.9744
## Detection Rate : 0.9744
## Detection Prevalence : 1.0000
## Balanced Accuracy : 0.5000
##
## 'Positive' Class : No
##
The ridge model has failed to classify any of the positive cases as positive, thus it has not improved the result obtained by linear regression. In fact, this model has predicted ‘No’ for all of the test data, the exact pitfall that we were concerned about.
The second regularization method we can use is LASSO (Least Absolute Shrinkage and Selection Operator).
Compared to Ridge, Lasso works with the absolute value of magnitude of coefficient in order to penalize coefficients. Consequently, when using Lasso, we can eliminate some coefficients altogether, performing feature selection.
Here we also operate with a parameter λ that can be subject to tuning. Using 10-fold cross-validation we arrive at the following λ value:
lambdaSelect = NULL
lambda.list.lasso = 1000 * exp(seq(0, log(1e-5), length = 100))
train.neurons <- train.cv %>% select(V1:V34)
train.labels <- train.cv$s33
test.neurons <- test.cv %>% select(V1:V34)
test.labels <- test.cv$s33
set.seed(1)
model.lasso.cv=cv.glmnet(as.matrix(train.neurons),train.labels, alpha = 1,nfolds = 10, lambda = lambda.list.lasso, family="binomial")
lambdaSelect <- model.lasso.cv$lambda.min
lambdaSelect
## [1] 0.01
The minimal misclassification error achieved by fine tuning the value of λ is:
model.lasso.cv$cvm[which (lambda.list.ridge==lambdaSelect)]
## [1] 0.1893284
We can also find out the number of coefficients that are not zero when using the above-selected value of λ:
paste(model.lasso.cv$nzero[which (lambda.list.ridge==lambdaSelect)], "coefficients are non-zero")
## [1] "6 coefficients are non-zero"
Plot showing how different values of λ reflect in the adjustment of coefficients in the resulting model ( the vertical line represents the value of λ we selected)
plot(model.lasso.cv$glmnet.fit, "lambda", label=FALSE)
abline(v = log(lambdaSelect))
Again, we construct a confusion matrix in order to look at how the model performs
pred.lasso=predict(model.lasso.cv,s=lambdaSelect, newx = as.matrix(test.neurons), type="response")
pred.lasso = as.factor(ifelse(pred.lasso>0.5,"Yes","No")[,1])
confusionMatrix(data= pred.lasso, reference = as.factor(test.labels))
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 909 24
## Yes 3 0
##
## Accuracy : 0.9712
## 95% CI : (0.9583, 0.9809)
## No Information Rate : 0.9744
## P-Value [Acc > NIR] : 0.7702472
##
## Kappa : -0.0057
##
## Mcnemar's Test P-Value : 0.0001186
##
## Sensitivity : 0.9967
## Specificity : 0.0000
## Pos Pred Value : 0.9743
## Neg Pred Value : 0.0000
## Prevalence : 0.9744
## Detection Rate : 0.9712
## Detection Prevalence : 0.9968
## Balanced Accuracy : 0.4984
##
## 'Positive' Class : No
##
As with Ridge, Lasso fails to correctly identify any of the sound 33 observations, and furthermore it also incorrectly misclassifies other sounds as sound 33.
Based on these results, imposing penalties on or reducing the number of features does not improve the overall performance of the model. Seemingly, all of the features are needed equally in order to represent the neuronal response we wish to categorize.
The next class of supervised machine learning models we wish to try are decision trees. Often compared to the “game of 20 questions,” decision trees are highly successful in both regression and classification problems while providing an understandable, or even “palatable” graphical representation of the model, and perform well on large datasets.
However, there are several limitations that need to be considered. In our case we will have to be particularly cautious of the danger of overfitting, as trees can be very non-robust against changes in data. Our data is inherently variable from trial to trial, and therefore could prove non-ideal for tree application.
First, we train a regular decision tree model
model.tree <- tree(s33 ~ ., data=train.cv)
summary(model.tree)
##
## Classification tree:
## tree(formula = s33 ~ ., data = train.cv)
## Variables actually used in tree construction:
## [1] "V33" "V34" "V31" "V16" "V9" "V4" "V11" "V5" "V32" "V7" "V19" "V15"
## Number of terminal nodes: 20
## Residual mean deviance: 0.09964 = 215.6 / 2164
## Misclassification error rate: 0.01877 = 41 / 2184
This model produces a low misclassification error rate. However, aware of the fact of how our data is structured, we will need to investigate further.
We can try and use cross-validation in order to prune the tree of branches
cv <- cv.tree(model.tree, K=10, FUN = prune.misclass)
best.cv = min(cv$size[cv$dev == min(cv$dev)])
paste("the minimum size of our best performing tree is ", best.cv)
## [1] "the minimum size of our best performing tree is 1"
Unfortunately, the lowest misclassification error is achieved when there is only one branch, which, if pruned at the very bottom, kills the tree model as the main branch becomes the terminal leaf.
Inspired by our previous work, we now look at the confusion matrix of the original model.
prob.training.tree=NULL
prob.training.tree = predict(model.tree, test.cv, type="class")
confusionMatrix(data = prob.training.tree, reference = test.cv$s33)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 908 23
## Yes 4 1
##
## Accuracy : 0.9712
## 95% CI : (0.9583, 0.9809)
## No Information Rate : 0.9744
## P-Value [Acc > NIR] : 0.770247
##
## Kappa : 0.0607
##
## Mcnemar's Test P-Value : 0.000532
##
## Sensitivity : 0.99561
## Specificity : 0.04167
## Pos Pred Value : 0.97530
## Neg Pred Value : 0.20000
## Prevalence : 0.97436
## Detection Rate : 0.97009
## Detection Prevalence : 0.99466
## Balanced Accuracy : 0.51864
##
## 'Positive' Class : No
##
Unfortunately, the non-pruned tree model has not been able to classify positive cases of sound 33 with significantly more success than any of the previously fitted models.
Since we have shown that it is challenging for a single decision tree to perform at our classification task, we will next try to combine a large number of “weak-learner” trees into an aggregated model where each tree is weighted based on its misclassification error. This method is called boosting.
We train a boosted tree model using 10000 trees and evaluate its performance using a confusion matrix.
set.seed(123) # for reproducibility
model.gbm <- gbm(ifelse(s33=="Yes",1,0) ~ ., data = train.cv, n.trees = 10000, shrinkage = 0.01)
## Distribution not specified, assuming bernoulli ...
prob.training.boost=NULL
prob.training.boost = as.factor(ifelse(predict(model.gbm, test.cv, type="response")>0.5,"Yes","No"))
confusionMatrix(data = prob.training.boost, reference = test.cv$s33)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 903 20
## Yes 9 4
##
## Accuracy : 0.969
## 95% CI : (0.9558, 0.9792)
## No Information Rate : 0.9744
## P-Value [Acc > NIR] : 0.87082
##
## Kappa : 0.2018
##
## Mcnemar's Test P-Value : 0.06332
##
## Sensitivity : 0.9901
## Specificity : 0.1667
## Pos Pred Value : 0.9783
## Neg Pred Value : 0.3077
## Prevalence : 0.9744
## Detection Rate : 0.9647
## Detection Prevalence : 0.9861
## Balanced Accuracy : 0.5784
##
## 'Positive' Class : No
##
According to the number of correctly classified positive cases, boosted trees have produced results slightly better than every model except for logistic regression, this does come at the expense of an increased false positive rate.
The last class of decision tree-based models is random forest. This method, much like boosting, uses many individual decision trees. However in the case of a random “forest of trees,” the segregating nodes defining branches are different since the single decision trees are trained on a random subsample from the dataset. This possibly mitigates the challenges posed to regular decision trees such as overfitting and non-robustness.
model.rfor = randomForest(x = train.neurons, y = train.labels, x.test= test.neurons, y.test=test.labels, ntree=10000, strata=s33, replace=TRUE, importance=T, mtry=30)
Since the randomForest() function can take in the test dataset, the model object contains a confusion matrix:
model.rfor$confusion
## No Yes class.error
## No 2112 16 0.007518797
## Yes 51 5 0.910714286
Based on the confusion matrix, the random forest method was unable to outperform the logistic regression.
While very successful in many machine learning problems due to their ability to implement non-linear decision surfaces, the performance of decision trees was not superior to other models in our case.
This was possibly caused by the noise in our data and very large trial-to-trial variability. By trying to determine which variables and their combinations help best in the classification of the data, decision tree-based models might lose the more “holistic” view of the population activity, which may be required in order to robustly learn the ever-changing representation.
The last model class we wish to try are support-vector machines (SVMs). This type of model uses support vectors determined by the training data in order to compute an optimal separating hyperplane. In addition, SVMs are capable of performing non-linear classification by transforming a low-dimensional input space into a higher dimension, which could in some cases help with separation (such as the two circular classes of data we saw in lecture).
In order to use the best model possible, we first tune the budget (cost) and gamma parameters.
We use the function tune() from the e1071 library to do so, specifying the number of cross-validation folds and enabling random sampling.
obj <- tune(svm, s33~.,
data = train.cv,
ranges = list(gamma = 2^(-5:5), cost = 2^(1:8)),
tunecontrol = tune.control(cross = 5, nboot = 10,sampling = "boot")
)
summary.svm<-summary(obj)
best.gamma.svm<-summary.svm$best.parameters[,1]
best.cost.svm<-summary.svm$best.parameters[,2]
paste("our optimal gamma from tuning is ", best.gamma.svm)
## [1] "our optimal gamma from tuning is 4"
best budget(cost) dget based on tuning
paste("our budget (cost) from tuning is ", best.cost.svm)
## [1] "our budget (cost) from tuning is 2"
Now we can train the model…
model.svm = svm(formula = s33 ~ .,
data = train.cv,
type = 'C-classification',
kernel = 'polynomial',
cost = best.cost.svm,
gamma = best.gamma.svm
)
As in all of the previous cases, we evaluate the performance of the model using a confusion matrix
pred.svm = predict(model.svm, newdata = test.cv)
confusionMatrix(data = pred.svm, reference = test.cv$s33)
## Confusion Matrix and Statistics
##
## Reference
## Prediction No Yes
## No 875 21
## Yes 37 3
##
## Accuracy : 0.938
## 95% CI : (0.9206, 0.9526)
## No Information Rate : 0.9744
## P-Value [Acc > NIR] : 1.00000
##
## Kappa : 0.0637
##
## Mcnemar's Test P-Value : 0.04888
##
## Sensitivity : 0.9594
## Specificity : 0.1250
## Pos Pred Value : 0.9766
## Neg Pred Value : 0.0750
## Prevalence : 0.9744
## Detection Rate : 0.9348
## Detection Prevalence : 0.9573
## Balanced Accuracy : 0.5422
##
## 'Positive' Class : No
##
SVMs did not perform better than the previously applied models, and with an overall accuracy of 93.8%, it in fact performed noticeably worse than our indiscriminately ‘No’ predicting model would. Unlike our other fitted models, here we see a significantly higher false positive rate.
We speculate that the ineffectiveness of our SVM model may be the result of the relatively high sensitivity of SVMs to noise. As previously mentioned, our data is inherently noisy and sparse, which may mean single outliers have a large influence on the computation of the optimal separating hyperplane even though the budget parameter is set to its optimal value.
We have analyzed a real-world calcium imaging dataset using methods of statistical machine learning, in order to demonstrate the existence of stable stimulus representation in neuronal activity.
After a significant amount of analysis performed within MATLAB (due to the limited availability of fluorescence processing tools in R), we were able to import and reformat a dataset representing recordings of neuronal activity of single cells within the murine auditory cortex.
Using unsupervised learning, we were able to select a candidate sound of interest (sound 33) in a specific mouse (f34c)by assessing its capacity to be clustered into a non-general mode of response of the population. As such we have succeeded in completing our first experimental aim.
Next, we tried fitting multiple model classes and strategies in order to train a classifier able to distinguish between the sound of interest and the rest of the stimuli.
We report that the logistic regression model displayed the best performance in terms of correctly classifying the positive case of population activity elicited by sound 33.
Although not an ideal classifier, given the nature of the data and its high trial-to-trial variability, this can be regarded as a positive result suggesting that the population response to a specific sound is something a machine learning model can be trained to recognize, thus adding relevance to the ideas presented in the theoretical introduction.
The reason for this result might be, hypothetically, that logistic regression has a tendency for higher bias and lower variance, which makes it less prone to overfitting and enables it to make stronger assumptions. Thus, it can possibly be able to take a more “holistic” approach without being overly dependent on the quality (in terms of noise for example or sparsness) of the training dataset.
On the other hand, decision trees and SVMs are more prone to be negatively influenced by noisy data as they rely on quality training sets. They are able to capture non-linearity within the data, but this advantage is perhaps rendered obsolete by the lower quality of the training data.